Setup

Set the working directory, clear all existing objects in the workspace and set the seed for reproducibility. Lastly, load the necessary libraries.

# Set the working directory
setwd("./")

# Clear all existing objects in the workspace
rm(list = ls())

# Set the seed for reproducible results
set.seed(1009)

# Load libraries
library(dplyr)
library(ggplot2)
library(ggmap)
library(scales)
library(RColorBrewer)

Attribute Definitions:

Load the Datasets

train.values <- read.csv("../data/clean-training-set-values.csv", header = TRUE, na.strings = "NA")
train.values$population[train.values$population == 0] <- NA
train.labels <- read.csv("../data/src-training-set-labels.csv", header = TRUE)

Analysis

For the analysis, the training values and labels will be merged, and the exploration will be conducted on the combined dataset. ## Merge the training values and labels

training.data <- merge(train.values, train.labels)

View the Summary Statistics

The following variables contain missing values:

  • funder - 4,414
  • installer - 4,435
  • wpt_name - 3,565
  • subvillage - 371
  • scheme_management - 3,877
  • scheme_name - 28,166
  • construction_year - 20,709
summary(training.data)
##        id          amount_tsh          date_recorded                      funder     
##  Min.   :    0   Min.   :     0.0   2011-03-15:  572   Government Of Tanzania: 9084  
##  1st Qu.:18520   1st Qu.:     0.0   2011-03-17:  558   Danida                : 3114  
##  Median :37062   Median :     0.0   2013-02-03:  546   Hesawa                : 2202  
##  Mean   :37115   Mean   :   317.7   2011-03-14:  520   Rwssp                 : 1374  
##  3rd Qu.:55656   3rd Qu.:    20.0   2011-03-16:  513   World Bank            : 1349  
##  Max.   :74247   Max.   :350000.0   2011-03-18:  497   (Other)               :37863  
##                                     (Other)   :56194   NA's                  : 4414  
##    gps_height          installer       longitude        latitude            wpt_name    
##  Min.   : -90.0   DWE       :17402   Min.   : 0.00   Min.   :-11.649   Shuleni  : 1748  
##  1st Qu.:   0.0   Government: 1825   1st Qu.:33.09   1st Qu.: -8.541   Zahanati :  830  
##  Median : 369.0   RWE       : 1206   Median :34.91   Median : -5.022   Msikitini:  535  
##  Mean   : 668.3   Commu     : 1060   Mean   :34.08   Mean   : -5.706   Kanisani :  323  
##  3rd Qu.:1319.2   DANIDA    : 1050   3rd Qu.:37.18   3rd Qu.: -3.326   Bombani  :  271  
##  Max.   :2770.0   (Other)   :32422   Max.   :40.35   Max.   :  0.000   (Other)  :52128  
##                   NA's      : 4435                                     NA's     : 3565  
##   num_private                    basin          subvillage            region     
##  Min.   :   0.0000   Lake Victoria  :10248   Madukani:  508   Iringa     : 5294  
##  1st Qu.:   0.0000   Pangani        : 8940   Shuleni :  506   Shinyanga  : 4982  
##  Median :   0.0000   Rufiji         : 7976   Majengo :  502   Mbeya      : 4639  
##  Mean   :   0.4741   Internal       : 7785   Kati    :  373   Kilimanjaro: 4379  
##  3rd Qu.:   0.0000   Lake Tanganyika: 6432   Mtakuja :  262   Morogoro   : 4006  
##  Max.   :1776.0000   Wami / Ruvu    : 5987   (Other) :56878   Arusha     : 3350  
##                      (Other)        :12032   NA's    :  371   (Other)    :32750  
##   region_code   district_code             lga               ward         population     
##  Min.   : 1.0   Min.   : 0.00   Njombe      : 2503   Igosi    :  307   Min.   :    1.0  
##  1st Qu.: 5.0   1st Qu.: 2.00   Arusha Rural: 1252   Imalinyi :  252   1st Qu.:   40.0  
##  Median :12.0   Median : 3.00   Moshi Rural : 1251   Siha Kati:  232   Median :  150.0  
##  Mean   :15.3   Mean   : 5.63   Bariadi     : 1177   Mdandu   :  231   Mean   :  281.1  
##  3rd Qu.:17.0   3rd Qu.: 5.00   Rungwe      : 1106   Nduruma  :  217   3rd Qu.:  324.0  
##  Max.   :99.0   Max.   :80.00   Kilosa      : 1094   Kitunda  :  203   Max.   :30500.0  
##                                 (Other)     :51017   (Other)  :57958   NA's   :21381    
##  public_meeting                  recorded_by          scheme_management        scheme_name   
##  False: 5055    GeoData Consultants Ltd:59400   VWC            :36793   K            :  682  
##  True :51011                                    WUG            : 5206   None         :  644  
##  NA's : 3334                                    Water authority: 3153   Borehole     :  546  
##                                                 WUA            : 2883   Chalinze wate:  405  
##                                                 Water Board    : 2748   M            :  400  
##                                                 (Other)        : 4740   (Other)      :28557  
##                                                 NA's           : 3877   NA's         :28166  
##    permit      construction_year    extraction_type  extraction_type_group
##  False:17492   Min.   :1960      gravity    :26780   gravity    :26780    
##  True :38852   1st Qu.:1987      nira/tanira: 8154   nira/tanira: 8154    
##  NA's : 3056   Median :2000      other      : 6430   other      : 6430    
##                Mean   :1997      submersible: 4764   submersible: 6179    
##                3rd Qu.:2008      swn 80     : 3670   swn 80     : 3670    
##                Max.   :2013      mono       : 2865   mono       : 2865    
##                NA's   :20709     (Other)    : 6737   (Other)    : 5322    
##   extraction_type_class            management      management_group
##  gravity     :26780     vwc             :40507   commercial: 3638  
##  handpump    :16456     wug             : 6515   other     :  943  
##  motorpump   : 2987     water board     : 2933   parastatal: 1768  
##  other       : 6430     wua             : 2535   unknown   :  561  
##  rope pump   :  451     private operator: 1971   user-group:52490  
##  submersible : 6179     parastatal      : 1768                     
##  wind-powered:  117     (Other)         : 3171                     
##                   payment          payment_type           water_quality    quality_group  
##  never pay            :25348   annually  : 3642   soft           :50818   colored :  490  
##  other                : 1054   monthly   : 8300   salty          : 4856   fluoride:  217  
##  pay annually         : 3642   never pay :25348   unknown        : 1876   good    :50818  
##  pay monthly          : 8300   on failure: 3914   milky          :  804   milky   :  804  
##  pay per bucket       : 8985   other     : 1054   coloured       :  490   salty   : 5195  
##  pay when scheme fails: 3914   per bucket: 8985   salty abandoned:  339   unknown : 1876  
##  unknown              : 8157   unknown   : 8157   (Other)        :  217                   
##          quantity          quantity_group                   source     
##  dry         : 6246   dry         : 6246   spring              :17021  
##  enough      :33186   enough      :33186   shallow well        :16824  
##  insufficient:15129   insufficient:15129   machine dbh         :11075  
##  seasonal    : 4050   seasonal    : 4050   river               : 9612  
##  unknown     :  789   unknown     :  789   rainwater harvesting: 2295  
##                                            hand dtw            :  874  
##                                            (Other)             : 1699  
##                source_type         source_class                      waterpoint_type 
##  borehole            :11949   groundwater:45794   cattle trough              :  116  
##  dam                 :  656   surface    :13328   communal standpipe         :28522  
##  other               :  278   unknown    :  278   communal standpipe multiple: 6103  
##  rainwater harvesting: 2295                       dam                        :    7  
##  river/lake          :10377                       hand pump                  :17488  
##  shallow well        :16824                       improved spring            :  784  
##  spring              :17021                       other                      : 6380  
##         waterpoint_type_group                  status_group  
##  cattle trough     :  116     functional             :32259  
##  communal standpipe:34625     functional needs repair: 4317  
##  dam               :    7     non functional         :22824  
##  hand pump         :17488                                    
##  improved spring   :  784                                    
##  other             : 6380                                    
## 

Visualize Pumps in Tanzania

To begin, pump locations are plotted on a map to get a sense of where they are located and determine if they are clustered in certain parts of the country. The pumps, regardless of status, are generally located in the same areas. Although, the Southern tip of Tanzania shows a higher concentration of non-functional pumps.

# Center of Tanzania
map <- get_map(c(lon = 34.888822, lat = -6.369028), source = "google", zoom = 6)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-6.369028,34.888822&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(map) + 
  geom_point(aes(x = longitude, y = latitude, colour = status_group), data = training.data, alpha = 1/2) +
  scale_colour_brewer(palette = "Set1", name = "Status Group")

Pump Status Distribution

An inspection of the pump status group shows that just over half of the water pumps are operational at 54.3%.

ggplot(data=training.data, aes(x=status_group)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)), fill=brewer.pal(3, "Set1"), alpha = 1/2) + 
    geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))), 
              stat = "count", vjust = -0.25) +
    scale_y_continuous(labels = percent) +
    theme(axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.y=element_blank()) +
    theme(plot.title = element_text(hjust = 0.5, face="bold")) +
    xlab("Operational Status") + ylab("Percent")

Inspect Categorical Variables

The next few plots compare categorical variables by pump status groups.

Basins

Ploting the water pumps by basin and separating them by status shows that the Pangani basin has the most functional water pumps. The Lake Victoria basin has the highest quantity of water pumps that require repair or are not functional. In addition, Lake Rukwa, Lake Tanganyika, and Ruvuma / Southern Coast have more non-functional pumps than functional pumps.

ggplot(training.data, aes(x=basin, fill = basin)) + geom_bar(alpha = 1/2) + 
  scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Basin"))

Regions

Comparing the region plot to the region map illustrates that regions where the quantity of non-functional pumps exceed functional pumps are not clustered in a specific part of the country. These regions include Lindi, Mara, Mtwara, Rukwa, and Tabora.

ggplot(training.data, aes(x=region, fill = status_group)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Status Group"))

Tanzania Region Map Figure 1. Tanzania region map. Reprinted from Tanzania regions (para. 1).

Permit

A potential hypothesis is that pumps without permits may not be built to the required standards, and therefore may have a higher amount of pumps needing repair or that are non-funtional. However, at the following plot depicts, for each pump status group, permitted pumps outnumber non-permitted pumps.

# Filter out observations with NA
permits <- training.data %>% select(permit, status_group) %>%
              filter(! is.na(permit))

ggplot(permits, aes(x=permit, fill = permit)) + geom_bar(alpha = 1/2) + 
  scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Permit"))

Extraction Type Class

Could the pump extraction type class contribute to the pump status group? As exhibited in the following plot, gravity and handpump extraction have a significantly higher level of functional pumps. In contrast, extraction types of motorpump and other cleary are less reliable.

ggplot(training.data, aes(x=extraction_type_class, fill = extraction_type_class)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Extraction Type Class"))

Water Quality Group

Is there a relationship between the water quality group and pump status? The following plot demonstrates that the overwhelming majority of pumps are classified as having good water. Only in the case of the water quality group of “unknown”, are there more non-fuctional pumps than functional pumps.

ggplot(training.data, aes(x=quality_group, fill = quality_group)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Quality"))

Water Quantity

The quantity of water available to a pump, or the lack thereof, could render a pump non-functional. As the following plot demonstrates, pumps where the water quantity category is dry, have the lowest level of functional pumps.

ggplot(training.data, aes(x=quantity, fill = quantity)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Quantity"))

Water Source

A visual inspection of the water source shows that spring and shallow well account for the majority of water sources. In addition, upon visual inspection, both categories have the most functional and non-functional pumps.

# Extend the Set1 palette
#set1.plus1 <- colorRampPalette(brewer.pal(9, "Set1"))(10)

# ggplot(training.data, aes(x=source, fill = source)) + 
#   geom_bar(alpha = 1/2) + scale_fill_manual(values = set1.plus1) + facet_grid(~status_group) +
#   theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
#   xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Source"))

ggplot(training.data, aes(x=source, fill = status_group)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Status Group"))

Water Point Type

Examining the water point types provide a discernible majority in the communal standpipe and hand pump categories. Across all categories, only communal standpipe multiple and other have more non-functional pumps compared to functional pumps.

ggplot(training.data, aes(x=waterpoint_type, fill = waterpoint_type)) + 
  geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
  theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
  xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Point Type"))

Inspect Continuous and Discrete Variables

The next few plots compare continuous and discrete variables by pump status groups.

Population

The popoluation attribute only contains 39,019 (64.0%) observations with a value greater than zero. In addition the data is signifcantly skewed. Taking the natural log of the population aids in correcting the skew into a somewhat normal bell shape. Future iterations will address options for the observations containing zero.

# Filter out observations with NA
population <- training.data %>% select(population, status_group) %>%
              filter(! is.na(population))
dim(population)
## [1] 38019     2
ggplot(population, aes(x = log(population))) + 
  geom_histogram(bins = 30, fill="#377EB8", alpha = 1/2) +
  facet_grid( ~ status_group) + xlab("Population") + ylab("")

Total Static Head Amount (amount water available to waterpoint)

The total static head amount variable also contains highly skewed value and a large amount of outliers.

ggplot(training.data, aes(x=status_group, y=amount_tsh)) + 
  geom_boxplot(fill="#377EB8") + coord_flip() + xlab("") + ylab("")

Construction Year

Each pump status displays a left-skewed histogram. Logically, this is reasonable since more pumps will exist over time as populations increase.

# Filter out observations with NA
constr.year <- training.data %>% select(construction_year, status_group) %>%
              filter(! is.na(construction_year))

ggplot(constr.year, aes(x = construction_year)) + 
  geom_histogram(bins = 20, fill="#377EB8", alpha = 1/2) +
  facet_grid( ~ status_group) + xlab("Year") + ylab("")

GPS Height

ggplot(training.data, aes(x = gps_height)) + 
  geom_histogram(bins = 30, fill="#377EB8", alpha = 1/2) +
  facet_grid( ~ status_group) + xlab("Height") + ylab("")

References

Wikimedia Commons. (2012). Tanzania regions. Retrieved from https://commons.wikimedia.org/wiki/File:Tanzania_regions.svg